Elsevier Hybrid Open Access Analysis

Publishers rarely make spending for hybrid open access articles transparent. Elsevier is an remarkable exception, because the publisher provides open and machine-readable data about central agreements with funding bodies and fee waivers at the article level. This blogpost demonstrates how to mine these invoice data from Elsevier full-texts with R. Analysing the resulting dataset of 70,660 hybrid open access articles published in around 1,753 journals between 2015 and now reveals that around one third of publication fees were paid through central agreements. Nevertheless, the majority of funding sources for hybrid open access articles remains unknown, raising important questions about the transparency of fee-based open access publishing.

Najko Jahn https://twitter.com/najkoja (State and University Library Göttingen)https://www.sub.uni-goettingen.de/
Aug 29, 2019

In September 2018, the cOAltion S, a group of international research funders, announced its widely discussed Plan S. According to its principles, publication fees that may arise when publishing open access, also known as article-processing charges (APC), should be covered by funders or research organizations directly. In the case of hybrid open access, the funders want to financially support publication fees just through transformative agreements, which are central invoicing agreements aiming at the transition of subscription-based journal publishing to fully open access.

Although surveys(Solomon and Björk 2011; Dallmeier-Tiessen et al. 2011) suggest that already many authors do not pay publication fees themselves, keeping track of these funding streams is challenging, because publishers rarely share invoice data. But also not all funders and research organizations report open access payments to crowd-sourcing initiatives like Open APC(Aasheim et al. 2019). Furthermore, transformative agreements introduce new requirements (Geschuhn and Stone 2017). Altogether, this leads to an intransparent situation that limits the monitoring of hybrid open access in general and agreements between publishers and funding bodies in particular.

At the SUB Göttingen, we will address this complex situation in a new project funded by the Deutsche Forschungsgemeinschaft (DFG). Building on our pilot, the interactive Shiny app Hybrid OA Journal Monitor, this project will investigate data needs from German library consortia and how they can be answered by metadata requirements in transformative agreements. Using open data, case-studies and data products will be developed to monitor levels of compliance with policy recommendations. Here, invoice data are essential to make the various funding streams for hybrid open access articles visible.

Against this background, this blogpost presents a dataset comprising publicly available invoice data for hybrid open access articles from Elsevier, a major publisher of scholarly journals. This dataset brings together metadata from Crossref and information retrieved from open access full-texts. The methods used to obtain the data not only address key challenges to discover hybrid open access articles along with funding and affiliation information using open data and tools. Elsevier’s effort to make invoice recipients openly available also serves as a good practice example for other publishers offering hybrid open access options and central agreements.

To demonstrate the potential of publisher-provided data for the monitoring of the transition of journals to open access, the dataset will be used to analyse the number and the proportion of hybrid open access articles in Elsevier journals. Drawing on Elsevier’s funding information, I will also investigate whether Elsevier sent invoices to authors or to funders that made an agreement with Elsevier, or if the fees were waived. Moreover, text-mined author email domains will provide a rough approximation of the affiliation of the first resp. corresponding author, an important data point for delineating open access funding.

The resulting dataset is openly available on GitHub along with the source code.

Methods

As a start, I used the Elsevier publication fee price list, an openly available pdf document, to determine the current hybrid open access journals in Elsevier’s journal portfolio. The rOpenSci tabulizer package (Leeper 2018) allowed to extract data about these journals from this file.

Then, I interfaced the Crossref REST API with the R package rcrossref (Chamberlain et al. 2019). The first API call retrieved facet field counts for license URLs and the yearly article volumes for the period 2015-19 for every journal. After matching Creative Commons license URLs indicating open access articles, a second API call retrieved article-level metadata per journal. Next, I used the delay-in-days metadata field to exclude delayed open access articles. Because of different date formats used for the delay calculation by Crossref, I allowed for a lag of 31 days.

Crossref metadata includes full-text links for text-mining purposes. Elsevier provides access to full-texts as html and xml document via the Crossref Text and Data Mining Services (Crossref-TDM). Surprisingly, the xml representation not only contains the full-text, but also embedded metadata including information about open access sponsorship in the <core> node:


<openaccess>1</openaccess>
<openaccessArticle>true</openaccessArticle>
<openaccessType>Full</openaccessType>
<openArchiveArticle>false</openArchiveArticle>
<openaccessSponsorName>
  BMBF - German Federal Ministry of Education and Research
</openaccessSponsorName>
<openaccessSponsorType>FundingBody</openaccessSponsorType>
<openaccessUserLicense>
  http://creativecommons.org/licenses/by/4.0/
</openaccessUserLicense>

Snapshot of open access metadata in Elsevier XML full-texts. https://api.elsevier.com/content/article/PII:S0169409X18301479?httpAccept=text/xml

After downloading the Elsevier full-texts with the crminer package(Chamberlain 2018), I extracted the above-highlighted open access informatiom from the xml documents.

Moreover, I parsed the first author email address, assuming that email domains roughly indicate the affiliation of the first respective corresponding author at the time of publication. The package urltools (Keyes et al. 2019) enabled to extract email domains and to split them in meaningful parts.

Finally, to measure the overlap between crowd-sourced and publisher-provided invoice data, I downloaded spending data from the Open APC initiative (Aasheim et al. 2019), which is, to my knowledge, the largest evidence-base for institutional spending on open access publication fees.

Dataset characteristics

In the following data analysis, I will be using two files. The first file, journal_facets.json, contains the number of publications per Elsevier journal that offers hybrid open access options and year. It furthermore provides the various license URLs found through Crossref.

The second file, elsevier_hybrid_oa_df.csv, comprises article-level data. Each row holds information for a single hybrid open access article, and the columns represent:

Variable Description
doi DOI
license Open Content License
issued Earliest publication date
issued_year Earliest publication year
issn ISSN, a journal identifier
journal_title The title of the journal
journal_volume Yearly publication volume
tdm_link Link to the XML full-text
oa_sponsor_type Invoice recipient type
oa_sponsor_name Institution that directly received an invoice
oa_archive Was open access provided through Elsevier’s open archive programme, in which articles are made openly available after an embargo?
host Email host, e.g. med.cornell.edu
tld Top-level domain, e.g. edu
suffix Extracted suffix from domain name as defined by the public suffix list, e.g. ac.uk
domain Email domain, e.g. cornell.edu
subdomain Email subdomain, e.g. med

It must be noted, however, that Elsevier did not provide an official documentation of its open access and invoice data at the time of writing this blogpost.

Results

In total, 1,753 out of 1,990 Elsevier journals with an open access option published at least one open access article between 2015 and now, corresponding to about 88 %. In these journals 70,660 hybrid open access articles appeared. The total share of hybrid open access in the publication volume of Elsevier journals was 2.4 %.

What is the uptake of hybrid open access among Elsevier journals?

The hybrid open access share varied across Elsevier journals. Figure 1, which replicates a boxplot aesthetics from The Economist magazine using the ggeconodist package (Rudis 2019), shows a slow, but steady hybrid open access uptake. The median open access proportion was around 3% in the first eleven months in 2019.

Hybrid open access uptake in Elsevier journals per year in percent, visualized as diminutive distribution chart. Since 2015, most journals have had a slow uptake rate of hybrid open access. In general, the hybrid open access publishing model played a marginal role compared to Elsevier's total publication volume. Data Source: Elsevier B.V / Crossref.

Figure 1: Hybrid open access uptake in Elsevier journals per year in percent, visualized as diminutive distribution chart. Since 2015, most journals have had a slow uptake rate of hybrid open access. In general, the hybrid open access publishing model played a marginal role compared to Elsevier’s total publication volume. Data Source: Elsevier B.V / Crossref.

How many hybrid open access articles were facilitated by central agreements with Elsevier?

In most cases, Elsevier sent invoices for hybrid open access publication fees to individual authors (59 %). For around 33 % of articles, institutions and funding bodies with an central invoicing agreement paid the publication fee directly for affiliated authors. Elsevier granted publication fee waivers to 6.2 % of hybrid open access articles.

Figure 2 shows the annual development per invocing type. Inspired by Claus O. Wilke’s “Fundamentals of Data Visualization” (Wilke 2019), each type is visualised separately as parts of the total. The figure reveals a general growth of hybrid open access articles. It illustrate that this development was mainly driven by individual options to pay for hybrid open access publication fees, while central invoicing stagnated. Also the amount of fee-waived articles remained more or less constant during 2015 and now.

Development of fee-based hybrid open access publishing in Elsevier journals by invoicing type. Colored bars represent the invoice recipient, or if the fee was waived. Grey bars show the total number of hybrid open access articles published by Elsevier journals between 2015 and now. Data Source: Elsevier B.V / Crossref.

Figure 2: Development of fee-based hybrid open access publishing in Elsevier journals by invoicing type. Colored bars represent the invoice recipient, or if the fee was waived. Grey bars show the total number of hybrid open access articles published by Elsevier journals between 2015 and now. Data Source: Elsevier B.V / Crossref.

The following interactive visualization created with the echarts4r package(Coene 2019) let you browse the invoicing data (3).

Figure 3: Breakdown of Elsevier hybrid open access journal articles by invoice recipient. Each rectangle represents an invoicing type and can be broken down by recipient. Data Source: Elsevier B.V.

Clicking on “Agreement” shows the funders and library consortia that covered hybrid open access publication fees directly. In total, Elsevier disclosed 74 different funding bodies that received an invoice for open access publication. Not surprinsingly, mostly British and Dutch funders or consortia paid for hybrid open access in Elsevier journals. But also the German Federal Ministry of Education and Research (BMBF) is well represented despite the current boycott from most universities and research organizations in Germany (Else 2018). In fact, the BMBF is not part of the Alliance of Science Organisations in Germany, which is behind the boycott. Since 2018, the BMBF has financially supported 181 hybrid open access articles that appeared in 129 Elsevier journals according to the publisher.

Who published hybrid open access in Elsevier journals?

In addition to funding information, email domains were parsed from Elsevier full-texts. These domains roughly indicate the affiliation of the first or of the corresponding authors, respectively, a data point used to delineate open access funding.

Email domain analysis of first resp. corresponding authors publishing hybrid open access in Elsevier journals. Around every fourth article published between 2015 and now was from an author affiliated with an UK-based academic institution. Data Source: Elsevier B.V.

Figure 4: Email domain analysis of first resp. corresponding authors publishing hybrid open access in Elsevier journals. Around every fourth article published between 2015 and now was from an author affiliated with an UK-based academic institution. Data Source: Elsevier B.V.

Figure 4 presents a breakdown by email domain suffix. In total, 67,903 email addresses were retrieved and parsed from Elsevier full-texts, corresponding to an share of 96 %. Most corresponding author emails originate from academic institutions in the UK (“ac.uk”), followed by domains from commercial organizations (“com”), and US-American institutions of higher education (“edu”). The figure illustrates that European institutions from Germany (“de”), the Netherlands (“nl”), and Sweden (“se”) were well represented. In total, 330 domain suffixes were retrieved.

In the following, a hierarchical, interactive treemap visualizes the distribution of the email domains (see Figure 5). While it appears that the distribution of email domains roughly represent the research landscape measured by publications, the dominance of domains from commercial organizations, mostly email providers like “gmail.com” or the Chinese “163.com” and “126.com” highlights the limitations of this approach to identify eligble funding institutions from email adresses.

Figure 5: Email domain analysis of first resp. corresponding authors publishing hybrid open access in Elsevier journals. Each top-level domain can be subdivided further into domain names representing academic institutions or companies. Data Source: Elsevier B.V.

How does Elsevier-provided data compare to spending information from the Open APC initiative?

Discussion and conclusion

Aasheim, Jens Harald, Benjamin Ahlborn, Chelsea Ambler, Magdalena Andrae, Jochen Apel, Hans-Georg Becker, Roland Bertelmann, et al. 2019. Open Apc Initiative. Bielefeld University Library. https://github.com/OpenAPC/openapc-de.

Chamberlain, Scott. 2018. Crminer: Fetch ’Scholary’ Full Text from ’Crossref’. https://CRAN.R-project.org/package=crminer.

Chamberlain, Scott, Hao Zhu, Najko Jahn, Carl Boettiger, and Karthik Ram. 2019. Rcrossref: Client for Various ’Crossref’ ’Apis’. https://CRAN.R-project.org/package=rcrossref.

Coene, John. 2019. Echarts4r: Create Interactive Graphs with ’Echarts Javascript’ Version 4. http://echarts4r.john-coene.com/.

Dallmeier-Tiessen, Suenje, Robert Darby, Bettina Goerner, Jenni Hyppoelae, Peter Igo-Kemenes, Deborah Kahn, Simon C. Lambert, et al. 2011. “Highlights from the SOAP Project Survey. What Scientists Think About Open Access Publishing.” http://arxiv.org/abs/1101.5260.

Else, Holly. 2018. “Dutch Publishing Giant Cuts Off Researchers in Germany and Sweden.” Nature 559 (7715): 454–55. https://doi.org/10.1038/d41586-018-05754-1.

Geschuhn, Kai, and Graham Stone. 2017. “It’s the Workflows, Stupid! What Is Required to Make ‘Offsetting’ Work for the Open Access Transition.” Insights: The UKSG Journal 30 (3): 103–14. https://doi.org/10.1629/uksg.391.

Keyes, Os, Jay Jacobs, Drew Schmidt, Mark Greenaway, Bob Rudis, Alex Pinto, Maryam Khezrzadeh, et al. 2019. Urltools: Vectorised Tools for Url Handling and Parsing. https://CRAN.R-project.org/package=urltools.

Leeper, Thomas J. 2018. Tabulizer: Bindings for Tabula Pdf Table Extractor Library.

Rudis, Bob. 2019. Ggeconodist: Create Diminutive Distribution Charts. https://gitlab.com/hrbrmstr/ggeconodist.

Solomon, David J., and Bo-Christer Björk. 2011. “Publication Fees in Open Access Publishing: Sources of Funding and Factors Influencing Choice of Journal.” Journal of the Association for Information Science and Technology 63 (1). Wiley-Blackwell: 98–107. https://doi.org/10.1002/asi.21660.

Wilke, Claus O. 2019. Fundamentals of Data Visualization. O’Reilly. https://serialmentor.com/dataviz/.